KDD Project Report Using Error-Correcting Codes for Efficient Text Classification with a Large Number of Categories
نویسنده
چکیده
We investigate the use of Error-Correcting Output Codes (ECOC) for efficient text classification with a large number of categories and propose several extensions which improve the performance of ECOC. ECOC has been shown to perform well for classification tasks, including text classification, but it still remains an under-explored area in ensemble learning algorithms. We explore the use of error-correcting codes that are short (minimizing computational cost) but result in highly accurate classifiers for several real-world text classification problems. Our results also show that ECOC is particularly effective for highprecision classification. In addition, we develop modifications and improvements to make ECOC more accurate, such as intelligently assigning codewords to categories according to their confusability, and learning the decoding (combining the decisions of the individual classifiers) in order to adapt to different datasets. To reduce the need for labeled training data, we develop a framework for ECOC where unlabeled data can be used to improve classification accuracy. This research will impact any area where efficient classification of documents is useful such as web portals, information filtering and routing, especially in open-domain applications where the number of categories is usually very large, and new documents and categories are being constantly added, and the system needs to be very efficient.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملUsing Error-Correcting Codes with Co-Training for text classification with a large number of categories
A major concern with supervised learning techniques for text classification is that they often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn effectively from a small number of labeled examples augmented with a large number of unlabeled examples. In this paper, we develop a framework t...
متن کاملUsing Error-Correcting Codes for Text Classification
This paper explores in detail the use of Error Correcting Output Coding (ECOC) for learning text classifiers. We show that the accuracy of a Naive Bayes Classifier over text classification tasks can be significantly improved by taking advantage of the error-correcting properties of the code. We also explore the use of different kinds of codes, namely Error-Correcting Codes, Random Codes, and Do...
متن کاملMulti-class Classification with Error Correcting Codes
Automatic text categorization has become a vital topic in many applications. Imagine for example the automatic classification of Internet pages for a search engine database. The traditional 1-of-n output coding for classification scheme needs resources increasing linearly with the number of classes. A different solution uses an error correcting code, increasing in length with O(log2(n)) only. I...
متن کاملMulti-class Text Categorization with Error Correcting Codes
Automatic text categorization has become a vital topic in many applications. Imagine for example the automatic classi cation of Internet pages for a search engine database. The traditional 1-of-n output coding for classi cation scheme needs resources increasing linearly with the number of classes. A di erent solution uses an error correcting code, increasing in length with O(log2(n)) only. In t...
متن کامل